HOMEWORK 2 - Creating Value Through Data Mining (S402010)

Quick Start

Loading Packages

library(data.table) # Efficient Dataframe 
library(lubridate) # For Dates 
library(tidyverse) # Multiple Package for Useful Data wrangling
library(esquisse) # Intuitive plotting
library(plyr) # Data splitting
library(dplyr) # Data Wrangling
library(ggplot2) # Plot Graphs
library(naniar) # for NA exploration in Dataframe
library(plotly) # Make ggplot2 Dynamic
library(gridExtra) # Multiple Plot at once
library(RColorBrewer) # For Color Palette
library(rmdformats) # Theme of HTML
library(flextable) # Show Table
library(class) # K-NN
library(summarytools) # Beautiful and Efficient Summary for Dataset
library(pivottabler) # Pivot Table
library(naivebayes) # Naive Bayes Function
library(caret) # Confusion Matrix

Those are required packages

Exercise 1.

5.6 Evaluating Predictive Performance

a.

If the company beings working with a new set of 1000 leads to sell the same services, similar to the 500 in the plot study, witout any use of predictive modeling to target sales efforts, what is the estimated profit?


Without any predictive modeling, we can roughly estimated the profit with the following formula:

\[ Sales_{Estimated} = 1000*\$2500 = \$250'000 \]


But the company will also have expenditures related to their sales, which would negatively impact the Total Profit. The sales effort would be:

\[ Costs_{Estimated} = 1000*\$2500 = \$250'0000 \]


Leading to Total Profit of…

\[ TotalProfit_{Estimated} = Sales_{Estimated} - Costs_{Estimated} \\ = \$250'000 - \$250'000 = 0 \]


b.

If the firm wants the average profit on each sale to at least double the sales effort cost, and applies an appopriate cutoff with this predictive model to a new set of 1000 leads, how far down the new list of 1000 should it proceed (how many deciles)?

\[ Ratio_{Estimated} = \dfrac{2*\$2500}{\$2500} = 2 \]

If we want to double the average profit on each sale, we should take the first decile (10%) on the Decile-wise lift chart which double the mean-response.


c.

Still considering the new list of 1000 leads, if the company applies this predictive model with a lower cutoff of $2500, how far should it proceed down the ranked leads, in terms of deciles?

\[ Ratio_{Estimated} = \dfrac{\$2500}{\$2500} = 1 \]

If we take a look at the Decile-wise lift chart, we see that until the 6th decile we would get a mean-response of 1 for all those previously included deciles. Thus we take all of them until the 6th decile.


d. 

Why use this two-stage process for predicting sales–why not simply develop a model for predicting profit for the 1000 new leads?

This two-stage process for predicting sales already give possible answer to the managers for the 1000 leads predictions and which target customers they should choose for achieving such goals. Those 2 graphs help getting fast insights to the sales predictions and what to do for achieving the goals required by the board or company management.

Exercise 2.

7.2 Personnal Loan Acceptance

Loading Dataset and Basic Data Quality Check

# Loading the Universal Bank Dataset into R
UniversalBank1 <- fread("DATA/UniversalBank.csv")

# Quick insight of the table look
flextable(head(UniversalBank1)) %>% 
  fontsize(size = 7, part = "all")

# Dataset Dimensions
dim(UniversalBank1)

# Variables Count
sapply(UniversalBank1, function(x) length(unique(x)))

# Dataset Type and Structure
str(UniversalBank1)

Summary of Universal.csv Dataset

# Load SummaryTools
library(summarytools)

# Required command by manual
st_options(plain.ascii = FALSE, style = "grid")

# Print the Summary to HTML
print(dfSummary(UniversalBank1, varnumbers = FALSE, 
                valid.col    = FALSE, 
                graph.magnif = 0.75), method="render")

Data Frame Summary

UniversalBank1

Dimensions: 5000 x 14
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
ID [integer]
Mean (sd) : 2500.5 (1443.5)
min ≤ med ≤ max:
1 ≤ 2500.5 ≤ 5000
IQR (CV) : 2499.5 (0.6)
5000 distinct values (Integer sequence) 0 (0.0%)
Age [integer]
Mean (sd) : 45.3 (11.5)
min ≤ med ≤ max:
23 ≤ 45 ≤ 67
IQR (CV) : 20 (0.3)
45 distinct values 0 (0.0%)
Experience [integer]
Mean (sd) : 20.1 (11.5)
min ≤ med ≤ max:
-3 ≤ 20 ≤ 43
IQR (CV) : 20 (0.6)
47 distinct values 0 (0.0%)
Income [integer]
Mean (sd) : 73.8 (46)
min ≤ med ≤ max:
8 ≤ 64 ≤ 224
IQR (CV) : 59 (0.6)
162 distinct values 0 (0.0%)
ZIP Code [integer]
Mean (sd) : 93152.5 (2121.9)
min ≤ med ≤ max:
9307 ≤ 93437 ≤ 96651
IQR (CV) : 2697 (0)
467 distinct values 0 (0.0%)
Family [integer]
Mean (sd) : 2.4 (1.1)
min ≤ med ≤ max:
1 ≤ 2 ≤ 4
IQR (CV) : 2 (0.5)
1:1472(29.4%)
2:1296(25.9%)
3:1010(20.2%)
4:1222(24.4%)
0 (0.0%)
CCAvg [numeric]
Mean (sd) : 1.9 (1.7)
min ≤ med ≤ max:
0 ≤ 1.5 ≤ 10
IQR (CV) : 1.8 (0.9)
108 distinct values 0 (0.0%)
Education [integer]
Mean (sd) : 1.9 (0.8)
min ≤ med ≤ max:
1 ≤ 2 ≤ 3
IQR (CV) : 2 (0.4)
1:2096(41.9%)
2:1403(28.1%)
3:1501(30.0%)
0 (0.0%)
Mortgage [integer]
Mean (sd) : 56.5 (101.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 635
IQR (CV) : 101 (1.8)
347 distinct values 0 (0.0%)
Personal Loan [integer]
Min : 0
Mean : 0.1
Max : 1
0:4520(90.4%)
1:480(9.6%)
0 (0.0%)
Securities Account [integer]
Min : 0
Mean : 0.1
Max : 1
0:4478(89.6%)
1:522(10.4%)
0 (0.0%)
CD Account [integer]
Min : 0
Mean : 0.1
Max : 1
0:4698(94.0%)
1:302(6.0%)
0 (0.0%)
Online [integer]
Min : 0
Mean : 0.6
Max : 1
0:2016(40.3%)
1:2984(59.7%)
0 (0.0%)
CreditCard [integer]
Min : 0
Mean : 0.3
Max : 1
0:3530(70.6%)
1:1470(29.4%)
0 (0.0%)

Generated by summarytools 1.0.2 (R version 4.2.1)
2022-10-31

We can see that most of the variables are of type “integer” except CCAvg being of type “numeric”. We have no missing datas in all variables. We can see this better with a plot:

Partition The Data Into Training (60%) and Validation (40%) Sets

The Following Code Set Seed to 1 and partition the dataset in 2 sets, training and validation.

# Setting Seed
set.seed(1)

# Splitting Training and Validation 
sample <- sample(c(TRUE, FALSE), nrow(UniversalBank1), replace=TRUE, prob=c(0.6,0.4))
training  <- UniversalBank1[sample, ]
validation   <- UniversalBank1[!sample, ]

# Checking if proportions are right
train_prop <- dim(training)
validation_prop <- dim(validation)

train_prop_100 <- (train_prop[1]/nrow(UniversalBank1))*100
validation_prop_100 <- (validation_prop[1]/nrow(UniversalBank1))*100

paste(train_prop_100,"% In Training",validation_prop_100,"% In Validation")

[1] “61 % In Training 39 % In Validation”

Here is the confirmation of the effective pourcentages of each set category after the partition process.

a. Considering the following customer:

Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0, Education_2 = 1, Education_3 = 0, Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1, and Credit Card = 1.

Perform a Κ-NN Classification with all predictors except ID and ZIP code using Κ = 1

# Setting Seed
set.seed(1)

# Removing Some Predictors
training <- training[,-c("ID","ZIP Code")]
validation <- validation[,-c("ID","ZIP Code")]

# Target Variable As Factor
training$`Personal Loan` <- factor(training$`Personal Loan`, levels = c(0,1),labels = c("No Loan","Loan")) 
validation$`Personal Loan` <- factor(validation$`Personal Loan`, levels = c(0,1),labels = c("No Loan","Loan")) 

# Education As Factor
training$Education <- as.factor(training$Education) 
validation$Education <- as.factor(validation$Education) 

# Education One-Hot Encoding
Education_As_Dummy_Training <- model.matrix(~0+training$Education)
Education_As_Dummy_Validation <- model.matrix(~0+validation$Education)

# Append to Training and Validation Sets
training <- cbind(training,Education_As_Dummy_Training)
training <- training[,-c("Education")]

validation <- cbind(validation,Education_As_Dummy_Validation)
validation <- validation[,-c("Education")]

# Renaming Education
training = training %>% rename( Education_1 = `training$Education1` , Education_2 = `training$Education2`, Education_3 = `training$Education3`)
validation = validation %>% rename( Education_1 = `validation$Education1` , Education_2 = `validation$Education2`, Education_3 = `validation$Education3`)

# Preprocess for Data Normalization

training_norm <- training
validation_norm <- validation

training_norm_s <- training[,-c("Personal Loan")]

norm_values <- preProcess(training_norm_s,method = c("center","scale"))

training_norm <- predict(norm_values,training)
validation_norm <- predict(norm_values,validation)

# KNN Model using class package
library(class)

# Data frame for a specific customer not in Data
Customer_Test <- data.frame("Age"=40,"Experience"=10,"Income"=84,"Family"=2,"CCAvg"=2,"Mortgage"=0,"Securities Account"=0,"CD Account"=0,"Online"=1,"CreditCard"=1,"Education_1"=0,"Education_2"=1,"Education_3"=0, check.names=FALSE)

# Preprocess the Customer New Data
Customer_Test_norm <- predict(norm_values, Customer_Test)

## KNN Training for Customer
predictions_customer <- knn(train=training_norm[,-c("Personal Loan")],test = Customer_Test_norm, cl = training_norm$`Personal Loan`, k=1)

# Append Predictions to Customer not in Data
Customer_Test$Predicted <- predictions_customer

Data of the Customer we want to predict after Normalization

# Table Customer after Normalization
flextable(head(Customer_Test_norm)) %>% 
  fontsize(size = 7, part = "all")

Data of the Customer we want to predict after Prediction

# Table Customer after Predictions
flextable(head(Customer_Test)) %>% 
  fontsize(size = 7, part = "all")

This Customer would be classified as not getting a Personnal Loan (No Loan) by K-NN with K=1.

b. What is the Choice of Κ that balances between overfitting and ignoring the predictor information?

We would to test multiple K such that the best accuracy would be chosen between the training and validations set (cross validation procedure)

# Setting Seed
set.seed(1)

# Load Caret 
library(caret)

# Number of iterations
max_iterations = 30

# Dataframe with 2 columns: k and accuracy
accuracy.df <- data.frame(k=seq(1,max_iterations,1),accuracy=rep(0,max_iterations))

# Compute K-NN for different k on validation
for(i in 1:max_iterations){
  # Testing K-NN
  knn.prediction <- knn(train = training_norm[,-c("Personal Loan")], test=validation_norm[,-c("Personal Loan")] , cl=training_norm$`Personal Loan`, k=i)
  # Storing into the accuracy.df results
  accuracy.df[i,2] <- confusionMatrix(knn.prediction, validation$`Personal Loan`)$overall[1]
}

# Table of Accuracy
flextable(accuracy.df) %>% fontsize(size = 12, part = "all")
# Ploting the K and accuracy together
ggplotly(
ggplot(accuracy.df) +
 aes(x = k, y = accuracy) +
 geom_line(size = 0.5, colour = "#1c6155") +
 labs(x = "Number of K", 
 y = "Accuracy (Between Training and Validation)", title = "K-NN Accuracy regarding parameter K") +
 theme_minimal()
)
# Choosing Efficient K
highest_K <- which.max(accuracy.df$accuracy)

print(paste("Best K for Highest Accuracy is",highest_K))

[1] “Best K for Highest Accuracy is 1”

We can see that the highest the K, the less is the accuracy of the model is through cross validation.

c. Show the confusion matrix for the validation data that results from using best Κ

# Setting Seed
set.seed(1)

# Computing Confusion Matrix with Best K
predictions_k <- knn(train=training_norm[,-c("Personal Loan")],test = validation_norm[,-c("Personal Loan")], cl = training_norm$`Personal Loan`, highest_K)

# Confusion Matrix
Confusion_Matrix_k <- confusionMatrix(data = predictions_k, reference = validation$`Personal Loan`)

# Plotting Matrix Function (In the References)
draw_confusion_matrix <- function(cm) {

  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX', cex.main=2)

  # create the matrix 
  rect(150, 430, 240, 370, col='#1c6155')
  text(195, 435, 'No Loan', cex=1.2)
  rect(250, 430, 340, 370, col='#1c615570')
  text(295, 435, 'Loan', cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col='#1c615570')
  rect(250, 305, 340, 365, col='#1c6155')
  text(140, 400, 'No Loan', cex=1.2, srt=90)
  text(140, 335, 'Loan', cex=1.2, srt=90)

  # add in the cm results 
  res <- as.numeric(cm$table)
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)

  # add in the accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}

#Plot the Confusion Matrix
draw_confusion_matrix(Confusion_Matrix_k)

This Confusion Matrix has a Accuracy of 0.968 and Specifity of 0.738, lower than the Sensitivity. Kappa is equal to 0.793. Among all Confusion Matrix, this one has the highest Specifity and Kappa. This Confusion Matrix has the highest F1 but really close to the last Confusion Matrix. Since the Specifity is the True negative rate, we can see that this model is not that good at predicting the Loan when it is actually a Loan. We can see that 48 customers were predicted no Loan while they actually had Loan, thus the following ratio:

\[ \dfrac{135}{135+48} = 0.738 = Specificity = True \ \ Negative \ \ Rate \]

d. Consider the following customer:

Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0, Education_2 = 1, Education_3 = 0, Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1, and Credit Card = 1.

# Setting Seed
set.seed(1)

# KNN Model on a specific customer not in Data
Customer_Test_2 <- data.frame("Age"=40,"Experience"=10,"Income"=84,"Family"=2,"CCAvg"=2,"Mortgage"=0,"Securities Account"=0,"CD Account"=0,"Online"=1,"CreditCard"=1,"Education_1"=0,"Education_2"=1,"Education_3"=0, check.names=FALSE)

# Preprocess the Customer New Data
Customer_Test_2_norm <- predict(norm_values, Customer_Test_2)

## KNN Training for Customer
predictions_customer_2 <- knn(train=training_norm[,-c("Personal Loan")],test = Customer_Test_2_norm, cl = training_norm$`Personal Loan`, k=highest_K)

# Append Predictions to Customer not in Data
Customer_Test_2$Predicted <- predictions_customer_2
# Table Customer after Predictions
flextable(head(Customer_Test_2)) %>%
  fontsize(size = 7, part = "all") 

This Customer would also be classified as not getting a Loan (No Loan).

e. Repartition the data (50%-30%-20%)

Now we want to partition the data in the following proportions: 50% in Training, 30% in Validation and 20% in Testing.

# Setting Seed
set.seed(1)

# Splitting Training and Validation and Test
splitting <- sample(1:3,size=nrow(UniversalBank1),replace=TRUE,prob=c(0.5,0.3,0.2))
train <- UniversalBank1[splitting==1,]
valid <- UniversalBank1[splitting==2,]
test <- UniversalBank1[splitting==3,]

# Checking if proportions are right
Prop_train <- (nrow(train)/nrow(UniversalBank1))*100
Prop_valid <- (nrow(valid)/nrow(UniversalBank1))*100
Prop_test <- (nrow(test)/nrow(UniversalBank1))*100

# Print Proportion
paste(Prop_train,"% In Training",Prop_valid,"% In Validation",Prop_test,"% In Test")
## [1] "51.28 % In Training 27.74 % In Validation 20.98 % In Test"

We check the partition result.

Now we prepare the data sets and normalize them before making predictions with K-NN

# Setting Seed
set.seed(1)

# Removing Some Predictors
train <- train[,-c("ID","ZIP Code")]
valid <- valid[,-c("ID","ZIP Code")]
test <- test[,-c("ID","ZIP Code")]

# Target Variable As Factor
train$`Personal Loan` <- factor(train$`Personal Loan`, levels = c(0,1), labels=c("No Loan","Loan")) 
valid$`Personal Loan` <- factor(valid$`Personal Loan`,levels = c(0,1), labels=c("No Loan","Loan")) 
test$`Personal Loan` <- factor(test$`Personal Loan`,levels = c(0,1), labels=c("No Loan","Loan")) 

# Education As Factor
train$Education <- as.factor(train$Education) 
valid$Education <- as.factor(valid$Education) 
test$Education <- as.factor(test$Education) 

# Education One-Hot Encoding
Education_As_Dummy_Train <- model.matrix(~0+train$Education)
Education_As_Dummy_Valid <- model.matrix(~0+valid$Education)
Education_As_Dummy_Test <- model.matrix(~0+test$Education)

# Append to Training and Validation Sets
train <- cbind(train,Education_As_Dummy_Train)
train <- train[,-c("Education")]

valid <- cbind(valid,Education_As_Dummy_Valid)
valid <- valid[,-c("Education")]

test <- cbind(test,Education_As_Dummy_Test)
test <- test[,-c("Education")]

# Renaming Education
train = train %>% rename( Education_1 = `train$Education1` , Education_2 = `train$Education2`, Education_3 = `train$Education3`)
valid = valid %>% rename( Education_1 = `valid$Education1` , Education_2 = `valid$Education2`, Education_3 = `valid$Education3`)
test = test %>% rename( Education_1 = `test$Education1` , Education_2 = `test$Education2`, Education_3 = `test$Education3`)

# Preprocess for Data Normalization
library(caret)

train_norm <- train
validn_norm <- valid
test_norm <- test

train_norm_s <- train[,-c("Personal Loan")]

norm_values_2 <- preProcess(train_norm_s,method = c("center","scale"))

train_norm <- predict(norm_values_2,train)
valid_norm <- predict(norm_values_2,valid)
test_norm <- predict(norm_values_2,test)

Confusion Matrix for Train VS Valid

# Train VS Valid

# Setting Seed
set.seed(1)

# Computing Confusion Matrix with Best K
predictions_k_new <- knn(train=train_norm[,-c("Personal Loan")],test = valid_norm[,-c("Personal Loan")], cl = train_norm$`Personal Loan`, highest_K)

# As Factor Predictions
predictions_k_new <- as.factor(predictions_k_new)

# Confusion Matrix
Confusion_Matrix_k_New <- confusionMatrix(data = predictions_k_new, reference = valid$`Personal Loan`)

#Plot the Confusion Matrix
draw_confusion_matrix(Confusion_Matrix_k_New)

Specificity is 0.669 which is lower than the Sensitivity (0.99), doing bad predictions for the True Negative (Loan) and worse than the first Confusion Matrix (0.669<0.738).

\[ \dfrac{87}{87+43} = 0.669 = Specificity = True \ \ Negative \ \ Rate \]

Confusion Matrix for Train VS Test

# Train VS Test

# Setting Seed
set.seed(1)

# Computing Confusion Matrix with Best K
predictions_k_new2 <- knn(train=train_norm[,-c("Personal Loan")],test = test_norm[,-c("Personal Loan")], cl = train_norm$`Personal Loan`, highest_K)

# As Factor Predictions
predictions_k_new2 <- as.factor(predictions_k_new2)

# Confusion Matrix
Confusion_Matrix_k_New2 <- confusionMatrix(data = predictions_k_new2, reference = test$`Personal Loan`)

#Plot the Confusion Matrix
draw_confusion_matrix(Confusion_Matrix_k_New2)

The Specificity is higher (0.716) than the previous Confusion Matrix (0.669 - Train VS Valid), but still lower than the Sensitivity. The Accuracy is a bit higher 0.965 > 0.96 than the previous Confusion Matrix (Train VS Valid). The Specifity is still lower than our first Confusion Matrix (0.716<0.738). There is more agreement in this Confusion Matrix than the previous one (Kappa => 0.767 > 0.735).

\[ \dfrac{68}{68+27} = 0.716 = Specificity = True \ \ Negative \ \ Rate \]

Since all Confusion Matrices give a bad Specificity Rate (lower than 0.8), I would not rely on this model too much when it comes to predicting the Customer getting Loan. This K-NN model give more “No Loan” correctly than the opposite.

Exercise 3

8.1 Personal Loan Acceptance

Partition the data into training (60%) and validation (40%) sets

UniversalBank2 <- fread("DATA/UniversalBank.csv")

# Setting Seed
set.seed(1)

# Splitting Training and Validation 
sample2 <- sample(c(TRUE, FALSE), nrow(UniversalBank2), replace=TRUE, prob=c(0.6,0.4))
training_8  <- UniversalBank2[sample2, ]
validation_8   <- UniversalBank2[!sample2, ]

# Checking if proportions are right
training_8_prop <- (nrow(training_8)/nrow(UniversalBank2))*100
validation_8_prop <- (nrow(validation_8)/nrow(UniversalBank2))*100

paste(training_8_prop,"% In Training",validation_8_prop,"% In Validation")

[1] “61 % In Training 39 % In Validation”

We can check the partition after the sampling code.

a) Pivot Table

# Duplicata of Training Data for Pivot Data
pivot_data <- training_8

# As Factor
pivot_data$Online <- factor(pivot_data$Online,levels = c(0,1),labels=c("Inactive Online","Active Online"))
pivot_data$CreditCard <- factor(pivot_data$CreditCard,levels = c(0,1),labels=c("No Credit Card","Credit Card"))
pivot_data$`Personal Loan` <- factor(pivot_data$`Personal Loan`,levels = c(0,1),labels=c("No Personal Loan","Personal Loan"))

# Pivot Table
pt <- PivotTable$new()
pt$addData(pivot_data) 
pt$addColumnDataGroups("Online")
pt$addRowDataGroups("CreditCard")
pt$addRowDataGroups("Personal Loan")
pt$defineCalculation(calculationName="Total", summariseExpression="n()")
pt$renderPivot()

Here is the first pivot table with Loan and Credit Card as rows, and Online as columns.

b) Probability of a Customer who owns a bank credit card and is actively using online banking services would accept the loan offer?

Using Bayes Theorem

\[\small P(Loan=1 | CC=1 \cap Online=1) = \\ \small \dfrac{54}{506+54} =\dfrac{54}{560} = 0.09642857 = 9.64 \%\]

Thus, there is 9.64% probability that this kind of customer would accept the loan offer.

c) Pivot Table in 2 Versions

# Pivot Table 1
pt1 <- PivotTable$new()
pt1$addData(pivot_data) 
pt1$addColumnDataGroups("Online")
pt1$addRowDataGroups("Personal Loan")
pt1$defineCalculation(calculationName="Total", summariseExpression="n()")
pt1$renderPivot()
# Pivot Table 1
pt2 <- PivotTable$new()
pt2$addData(pivot_data) 
pt2$addColumnDataGroups("CreditCard")
pt2$addRowDataGroups("Personal Loan")
pt2$defineCalculation(calculationName="Total", summariseExpression="n()")
pt2$renderPivot()

Here are the two pivot table, each for looking at Credit Card VS Loan or Online VS Loan

d) Compute the following quantities

i.

\[\small P (CC=1 | Loan=1) = \\ \small \dfrac{P(Loan=1|CC=1)*P(CC=1)}{P(Loan=1)} = \\ \small 0.3198653 = 31.99 \%\]

ii.

\[\small P (Online=1 | Loan=1) = \\ \small \dfrac{P(Loan=1|Online=1)*P(Online=1)}{P(Loan=1)} = \\ \small \dfrac{188}{297}= 0.6329966 = 63.30 \%\]

iii.

\[\small P (Loan=1) = \dfrac{297}{3050} = 0.09737705 = 9.74 \% \]

iv.

\[\small P (CC=1 | Loan=0) = \\ \small \dfrac{P(Loan=0|CC=1)*P(CC=1)}{P(Loan=0)} = \\ \small \dfrac{827}{2753} = 0.3003996 = 30.04 \%\]

v.

\[\small P (Online=1 | Loan=0) = \\ \small \dfrac{P(Loan=0|Online=1)*P(Online=1)}{P(Loan=0)} =\\ \small \dfrac{1626}{2753} = 0.5906284 = 59.06 \%\]

vi.

\[\small P (Loan=0) = \dfrac{2753}{3050} = 0.902623 = 90.26\%\]

e) Compute naive Bayes Probability

\[\small P(Loan=1|CC=1,Online=1)\] Using the naive Bayes Probability give us the following computation:

\[\small P(Loan=1|CC=1,Online=1) = \\ \small \dfrac{P(Loan=1)*P(CC=1|Loan=1)*P(Online=1|Loan=1)}{P(CC=1)*P(Online=1)}=\\ \small \dfrac{0.09737705*0.3198653*0.6329966}{0.3022951*0.5947541} = 0.1096621 = 10.97\% \]

f) Accurate Estimate

The Naive approach give us 10.97%, while the Complete Bayes probability give us 9.64%

g) Run Model and Comparisons

\[P(Loan=1|CC=1,Online=1)\]

# As factor for Loan
training_8$Online <- factor(training_8$Online,levels = c(0,1),labels=c("Inactive Online","Active Online"))
training_8$CreditCard <- factor(training_8$CreditCard,levels = c(0,1),labels=c("No Credit Card","Credit Card"))
training_8$`Personal Loan` <- factor(training_8$`Personal Loan`,levels = c(0,1),labels=c("No Personal Loan","Personal Loan"))

Naivebayes <-  naive_bayes(training_8$`Personal Loan` ~ training_8$CreditCard + training_8$Online, data=training_8)

summary(Naivebayes)
## 
## ================================== Naive Bayes ================================== 
##  
## - Call: naive_bayes.formula(formula = training_8$`Personal Loan` ~ training_8$CreditCard +      training_8$Online, data = training_8) 
## - Laplace: 0 
## - Classes: 2 
## - Samples: 3050 
## - Features: 2 
## - Conditional distributions: 
##     - Bernoulli: 2
## - Prior probabilities: 
##     - No Personal Loan: 0.9026
##     - Personal Loan: 0.0974
## 
## ---------------------------------------------------------------------------------
Naivebayes
## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.formula(formula = training_8$`Personal Loan` ~ training_8$CreditCard + 
##     training_8$Online, data = training_8)
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
## No Personal Loan    Personal Loan 
##       0.90262295       0.09737705 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: training_8$CreditCard (Bernoulli) 
## --------------------------------------------------------------------------------- 
##                      
## training_8$CreditCard No Personal Loan Personal Loan
##        No Credit Card        0.6996004     0.6801347
##        Credit Card           0.3003996     0.3198653
## 
## --------------------------------------------------------------------------------- 
##  ::: training_8$Online (Bernoulli) 
## --------------------------------------------------------------------------------- 
##                  
## training_8$Online No Personal Loan Personal Loan
##   Inactive Online        0.4093716     0.3670034
##   Active Online          0.5906284     0.6329966
## 
## ---------------------------------------------------------------------------------

\[\small P(Loan=1|CC=1,Online=1) = \\ \small \dfrac{P(Loan=1)*P(CC=1|Loan=1)*P(Online=1|Loan=1)}{P(CC=1)*P(Online=1)}=\\ \small \dfrac{0.09737705*0.3198653*0.6329966}{0.3022951*0.5947541} = 0.1096621 = 10.97\% \]

The Naive Model and above question e) should give us the same estimate probabilities. We can see a difference in the Naive approach and Complete Bayes Probabilities. Naive Bayes assumes conditional independence where Complete Bayes theorem does not, this is why there is such a difference in estimates, but such strong conditional independence may not be relevant everytime. For instance, having a credit card may be correlated to being online such that the credit card gives some special access to online services that the customer wanted, thus the buy of this credit card. The naive approach is faster to compute, but the Bayesian Network Classifiers would be perhaps better since you can specify which features could be independent or not.

References

Github Repo for this Homework 2

Data Mining for Business Analytics: Concepts, Techniques, and Applications in R

How to Split Data into Training & Test Sets in R (3 Methods)

R how to visualize confusion matrix using the caret package

Friedman, N., Geiger, D., Provan, G., Langley, P. and Smyth, P. (1997). Bayesian Network Classifiers